/ 




Character and Document Research in 
the Open Mind Initiative 

David G. Stork 
Ricoh Silicon Valley 
2882 Sand Hill Road #115 
Menlo Park, CA 94025-7022 
storkQrsv . ricoh . com 



Abstract gage in conversation, play world-class chess or 

go — have been a dream and research goal 
We describe the Open Mind Initiative, a for the last haif-century. Efforts to build such 
framework for large-scale collaborative efforts systems have relied on both theory and data 
in building components of "intelligent'' sys- [11, 27], and there has been slow incremental 
tems that address common-sense reasoning, improvement in a number of areas. We ar- 
document and language understanding, speech gue that several disciplines would profit from 
and character recognition, and so on. Based very large amounts of data and an open frame- 
on the Open Source methodology, the Open work for experimentation and collaboration, 
A4ind fnitiative allows domain specialists to and point to a new methodology — The Open 
contribute algorithms, tool developers to pro- Mind Initiative — to this end. 
vide software infrastructure and tools, and The paper is organized as follows: In Sect. 2 
non- specialist "e-ciiizens" to contribute train- we contend that current models are adequate 
ing data and information to large databases, or nearly adequate for many important pat- 
An important challenge is to make it easy and tern recognition and knowledge engineering 
rewarding for e-citizens to provide such infor- problems, and that it is the lack of training 
mation. This paper illustrates the Initiative data and an open framework for systems engi- 
through several demonstration projects of mod- neering and integration that retards progress. 
est scale, including some related to character Even in domains where the models may not 
and document problems, and identifies general yet be adequate, large amounts of knowledge 
challenges and opportunities. and training data are needed to decide be- 
tween competing models and thus to acceler- 
ate progress. In Sect. 3 we show that such 
1 Introduction limitations may be overcome in part by a new 

software methodology, Open Source, and the 

"Intelligent" machines — ones that can un- infrastructure of the web. This leads to a 

derstand speech, read handwriting, recognize description of The Open Mind Initiative in 

objects and actions from images or video clips, Sect. 4. An important distinction with Open 

summarize stories, reason about the world, en- Source is the reliance of Open Mind on in- 



1 



frastructure and tools and large numbers of 
relatively untutored contributors. The Initia- 
tive is illustrated in Sect. 5 with several pro- 
posed projects, including two from character 
and document related areas. The paper con- 
cludes in Sect. 6 with some general remarks 
and important challenges. 

2 Models, tools and data 

All pattern recognition and intelligent systems 
rely on theory and models as well as a knowl- 
edge base or training data; such systems (par- 
ticularly fielded commercial ones) also rely on 
a great deal of software engineering. 

2.1 Models 

In very broad terms, recent work in many ar- 
eas of pattern recognition and artificial intelli- 
gence has relied increasingly upon fairly gen- 
eral models, such as powerful statistical ones, 
trained with a great deal of data. The funda- 
mental theoretical underpinnings of domain- 
independent pattern recognition — maximum- 
likelihood and Bayesian techniques, function 
estimation, and so on — are highly developed 
and rigorous. While there will continue to be 
effort and progress, the foundations as cur- 
rently understood are sufficient for developing 
successful pattern classifiers in many domains. 

The adequacy of even very simple mod- 
els is illustrated in optical character recogni- 
tion, where accurate recognizers can use sim- 
ple models {decision trees, neural networks, 
...) trained with millions of characters. These 
outperform recognizers based on sophisticated 
models trained with less data. As Ho and 
Baird conclude, "quality of training data is 
the dominant factor affecting accuracy" and 
"as long as the training data are representative 
and sufficiently many, a wide range of classi- 
fier technologies can be trained to equally high 



o 


o 




o.oous 










E = 0.00152 



Figure 1: General, non-specific models trained 
with a large amount of data can outperform 
better models trained with a small amount of 
data (see text). 

accuracies" [15], a point stressed elsewhere in- 
dependently [28, 13]. The key lesson from 
decades of work in isolated character recogni- 
tion is this: Even a weak (but general) model 
can yield excellent results if it is trained with 
sufficient data. 

This fact is illustrated schematically in 
Fig. 1. The top figure shows the ground truth 
for a two-dimensional, two-category classifi- 
cation task with two equal- variance circular 
Gaussian prior distributions. The Bayes er- 
ror, Eb, is shown. The figure at the left shows 
excellent models, circular Gaussians, whose 
means and variances were trained by maxi- 
mum likelihood with a small number of points; 
the resulting decision boundary and test error 
are shown as well. The center figure shows a 
less-specific model, where each Gaussian can 
have arbitrary covariance matrix; these Gaus- 
sian models were trained with a large amount 
of data, however. The final classifier has a 
test error lower than that of the better model 
at the left. The figure at the right shows the 
decision boundary of a nearest- neighbor clas- 
sifier trained by learning with queries, where 



2 




the points presented to users were generated as 
mid-way between two points, one chosen ran- 
domly from each category. Because a large 
number of user-classified points are used as 
training data, even this simple classifier out- 
performs the better model at the left. In 
short, a very simple nearest-neighbor classifier 
trained interactively and a weak general model 
trained with a large amount of data outper- 
form an excellent model trained with a small 
amount of data. 

This need for large training sets is a les- 
son that recurs in a number of domains, 
from acoustic speech recognition [19] and 
speechreading [33] to face recognition [23], ges- 
ture recognition [24], natural language pro- 
cessing [10], speech production [35] and oth- 
ers. Moreover, for areas where we may not 
yet have adequate models, we know how to 
broaden and improve classes of models — to 
include more degrees of freedom to account for 
sources of variation, to set parameters, and 
so on — given enough data. Again, in broad 
terms, it appears that after decades of work 
we now have (or know how to create) adequate 
models in some disciplines and that progress is 
retarded by the lack of adequately large knowl- 
edge bases and training data. 

2.2 Infrastructure and tools 

Another key component in building such sys- 
tems involves software tools. There are many 
commercial tools for developing speech and 
natural language systems which allow develop- 
ers to explore model parameters easily, specify 
grammars, lexicons, and so on. Some of these 
tools are provided free of charge by companies 
in order to promote the sale and use of their 
hardware, such as speech chips [34, 16). An 
important lesson here is that non-specialist de- 
velopers can create useful systems, given suf- 
ficiently good tools. 




2.3 Training data 

In some domains, the need for training data 
is addressed with synthetic data, in which 
raw sensed patterns are transformed in model- 
based ways to yield surrogate ones. For exam- 
ple, in optical character recognition synthetic 
data is constructed by automatically rotat- 
ing, warping, line thinning or thickening, and 
adding pixel noise to sensed patterns. Syn- 
thetic data is attractive as it is far simpler 
to obtain than an "equivalent" number of raw 
sensed patterns, though some controversy re- 
mains as to their relative merits. One great 
benefit of synthetic data is in learning with 
queries. Here synthetic patterns can be gen- 
erated "on demand" in informative regions of 
feature space, i.e., near decision boundaries. 

The appreciation of the need for large 
knowledge bases and training data has led 
to the construction of publicly available 
databases. The National Institutes of Stan- 
dards and Technology (NIST), the Linguistics 
Data Consortium (LDC), The Center for Ex- 
cellence in Document Analysis and Recogni- 
tion (CEDAR), Information Science Research 
Institute (ISRI), the University of Washington 
CD-ROM and other sites have compiled large 
databases of training data related to language 
and documents. One of the largest comes from 
the Macrophone project, compiled by Texas 
Instruments, a collection of roughly 200,000 
utterances of free telephone speech by non- 
specialists, constrained by topic [6]. While 
these and other public databases have been 
vital to continued improvements in recogniz- 
ers, some of the best systems are trained with 
additional data, usually proprietary [7]. This 
need for data is not restricted to training data 
in the classical pattern recognition framework, 
but includes more general information, such 
as diseases and their medical symptoms [31], 
common sense facts about the world [21], syn- 
onyms and relations, etc. [12]. 



3 



3 Open Source Model 



4 Open Mind Initiative 



Recent technical, social and economic devel- 
opments suggest a new approach to augment- 
ing knowledge databases and to building in- 
telligent systems. A decade or two ago, none 
but a few eccentric visionaries could imagine 
that large-scale highly reliable software could 
be developed outside of corporate, government 
or university research labs. Nevertheless, the 
Free Software Foundation and now particu- 
larly the Open Source movement — which 
promotes software reliability and quality by 
supporting independent peer review and rapid 
evolution of source code distributed for free 
— have blossomed from curiosities to signif- 
icant trends. These trends have already in- 
fluenced the world software market (particu- 
larly in operating systems and internet soft- 
ware) and show no sign of abating. Lead- 
ing Open Source software includes: Linux, a 
Unix-like operating system {with nearly 100 
million lines of code and over 10 million in- 
stalled seats, built by roughly 100,000 contrib- 
utors [26]); the Mozilla version of the Netscape 
Navigator Web browser (several hundred thou- 
sand lines); SendMail, a utility on virtually 
every Unix machine; and Apache, which runs 
over half of the world's web servers. A particu- 
larly relevant example here is the Open Source 
web indexing produced by Newhoo (purchased 
by Netscape/ AOL). In the Newhoo approach, 
numerous non-specialist contributors propose 
keyword and index information about web 
pages; this information is reviewed by volun- 
teer referee/editors (currently 4600), collected 
and made available to all. The software in- 
frastructure and tools themselves were devel- 
oped through the Open Source method. While 
proprietary software generally improves loga- 
rithmically with time, Open Source generally 
improves exponentially. 



In light of these developments we can approach 
the creation of intelligent systems based in 
part on the Open Source model — the Open 
Mind Initiative. It is perhaps simplest to 
consider its three component functions — 
provided by domain experts, infrastructure 
and tool developers, and e-citizens — cor- 
responding roughly to the three headings in 
Sect. 2. Domain experts contribute libraries 
of fundamental algorithms; tool developers 
contribute and refine the enabling software; 
e-citizens contribute information and training 
data. Users with an interest and expertise in a 
particular domain such as speech, vision, lan- 
guage, common sense, and so forth could serve 
as reviewers or moderators. All this is possi- 
ble given the infrastructure of the internet and 
world wide web. 

As in Open Source, the Open Mind Ini- 
tiative would be only loosely structured, and 
there is no clear notion of hierarchy. Individ- 
uals might participate in several stages at dif- 
ferent times and each project will require a 
different level of effort on the various compo- 
nents. There would be a number of component 
projects, with varying amounts of interactions. 



4.1 Domain experts 

Experts in a specific area such as optical char- 
acter recognition or speech recognition will 
submit documented libraries of fundamental 
algorithms, and possibly representative train- 
ing sets, all in Open Source and freely avail- 
able to all. Much of this work has already been 
published in refereed journals. This approach 
extends the trend in academic pubhshing in 
which algorithms and data are published in 
electronic form on the web. 




4.2 Tool developers 

Key components to the Initiative are pro- 
vided by the infrastructure and tool develoF>- 
ers; these components differ somewhat from 
those in traditional software engineering and 
research. Aside from user interfaces, format 
conversion, and so on, software will have to 
detect significant errors or "outliers" in con- 
tributed data for review and possible eUmi- 
nation. The distinct challenge, however, is 
to make it easy for e-citizens to contribute 
data, and here new forms of infrastructure will 
have to be developed. Consider Animals, an 
interactive children's program from the late- 
1970s and early 1980s for classifying animals 
[30]. The child thinks of an animal, and the 
program tries to guess it from the child's re- 
sponses to a series of questions, such as able 

TO FLY (yes or NO)?, TWO LEGGED OR 

FOUR LEGGED?, and SO forth. After the final 
question, the program makes a guess, for in- 
stance CHICKEN. If the guess is wrong (the 
child was thinking instead of turkey), the 
child must provide a new question that dis- 
tinguishes her animal from the one guessed 
by the program, e.g., red comb on head?. 
This new query is automatically incorporated 
into the program. After a number of children 
have played this guessing game, Animals has 
learned a simple tree-based classifier for ani- 
mals. We have written a Java-based program, 
Open Mind Animals, in order to develop tools 
and infrastructure useful in the Open Mind 
Initiative [20]. 

Similarly, the Answer Garden project has 
demonstrated that knowledge bases for help 
desk environments can be grown "organically" 
and semi-automatically [1, 2). Perhaps new 
and more sophisticated games, inspired by An- 
imals and Answer Garden could be part of the 
Open Mind Initiative, and would encourage 
large numbers of users to contribute with min- 
imal burden. This basic approach is already 




exploited in DirectHit, which refines web in- 
dexes based on actual user search sequences. 
We call this approach "unconscious knowledge 
capture." Here expertise in design, advertis- 
ing and marketing would be needed to com- 
plement more traditional software skills. 

In such a relatively unstructured massive 
collaborative software project many technical 
problems must be considered — everything 
from low-quality data to outright hostile at- 
tacks. A number of simple heuristics in data 
"truthing" could reduce the possibility of poor 
data. For instance, any query from the Open 
Mind system could be presented to three in- 
dependent, randomly chosen users, and the 
reply accepted if all three agree. Likewise, 
there are domain-dependent algorithms for au- 
tomatically identifying "outliers" — responses 
that differ drastically from the current consen- 
sus. Such outliers could be brought automat- 
ically to the attention of a moderator/referee 
for review. There are many techniques from 
experimental psychology for insuring the qual- 
ity of the data, such as the insertion of a "catch 
trial" which has only one plausible answer; an 
incorrect answer on such a catch trial belies 
an unreliable contributor and invalidates his 
or her recent submissions. 

4.3 E-citizens 

The biggest difference between traditional 
Open Source and the Open Mind Initiative is 
the need for data provided by e-citizens. Such 
contributions have been made in other fields, 
and given the right circumstances should oc- 
cur in the Open Mind Initiative too. For in- 
stance, amateur astronomers discover comets 
and contribute measurements of variable stars; 
over 13,000 amateur ornithologists participate 
in bird counts annually; amateurs scoured in- 
numerable books for words to contribute to 
the massive Oxford English Dictionary project 
[25]; amateur paleontologists discover fossils 



5 




and promising excavation sites; over half of the 
mollusks in public museums come from ama- 
teurs [22); and so forth. When NASA's High 
Resolution Microwave Survey project (which 
included the Search for Extraterrestrial Intel- 
ligence) was canceled, the non-profit SETI In- 
stitute was founded and thousands of indi- 
viduals donated time on their personal com- 
puters to search snippets of radio telescope 
recordings for tell-tale signs of an intelligent 
life [32]. E-citizens might be motivated to 
make a meaningful contribution to a research 
project such as the Open Mind Initiative, per- 
haps through schools. Even if a tiny percent- 
age of people with web access provide a small 
amount of information, the training sets can 
grow by three or four orders of magnitude. For 
these people there is a need to provide a clear, 
realistic, yet inspiring vision of the project and 
its value to science and society. 

While contributing e-citizens should be per- 
mitted to remain anonymous, others might 
be glad to see their name listed in the Open 
Mind Initiative website, perhaps ranked by 
the amount of their submitted information 
was accepted into the system. It was found, 
for example, that in a collaborative effort 
for developing documentation for field service, 
service professionals actually preferred com- 
munity recognition and improved reputation 
(through posting of their name) over mone- 
tary incentives, which were viewed as corrupt- 
ing the process [5]. A somewhat less lofty but 
probably effective inducement could be pro- 
vided by sales or service companies, who would 
give discounts or benefits (e.g., frequent flyer 
miles) proportional to the amount of infor- 
mation an e-citizen contributes. In this way, 
the participating companies would attract new 
customers. Likewise a lottery could be insti- 
tuted, in which each submission of data gives 
its contributor a small chance of winning a 
large prize. 

Regardless of what motivates someone to 




contribute, all will want to see how the system 
progresses, be it by some raw measure of the 
total amount of information contributed, or in 
the quantitative performance of one of its com- 
ponents such as recognition accuracy. Other, 
qualitative indications of progress would be 
useful too. Just as parents delight in watching 
the cognitive development of their child, so too 
would contributors be excited to see an Open 
Mind common-sense reasoning system develop 
its "understanding" of the world. 

5 Sample projects 

Here we sketch three proposals for projects 
that seem well suited to the Open Mind Initia- 
tive: handwritten character recognition, hand- 
written word recognition and domain knowl- 
edge organization. A fourth is already in de- 
velopment and will be reported elsewhere [20]. 

5.1 Isolated handwritten charac- 
ter recognition 

The task is the recognition of isolated hand- 
written characters, for instance displayed as 
8x8 gray level pixel images. The classifier 
could be very simple indeed, such as a ba- 
sic decision tree [8] or a neural network; the 
eflfort will center on infrastructure, tools and 
the social and organizational problems. Sup- 
pose that while your web browser boots up, 
it displays an image of a handwritten charac- 
ter, along with two buttons each labeled by a 
category, e.g., "4" and "9." You click on the 
button corresponding to your perception of the 
image, and your reply is sent automatically to 
an Open Mind repository, a tiny contribution 
toward improved character recognizers. Train- 
ing could be efficient since the system would 
present to e-citizens only ambiguous patterns 
(i.e., most informative), a technique known as 
learning with queries [3, 4]. This technique 



6 




often provides a distinct advantage over tradi- 
tional i.i.d. sampling in that data can be pro- 
vided near decision boundaries {cf. right fig- 
ure in Fig. i). Other such informative pat- 
terns could be generated by transforming raw 
patterns (line thinning, rotation, skew, etc.). 
There should be an interface to enable very 
active contributors to quickly and easily pro- 
vide a large number of responses. With mil- 
lions of such replies (possibly in different con- 
texts), the system's recognizer would become 
accurate indeed. 

One method for "truthing" — detecting and 
reducing poor quality data, eliminating grossly 
misshapen patterns, and so forth — would be 
to present any individual character image to 
three independent, randomly chosen contrib- 
utors, and accept their result only if all three 
agree. Other, more sophisticated and auto- 
matic methods can be used to detect outliers 
for rejection [14], avoid statistical bias, and so 
on. 

Contributors might include a portion of 
the members of laboratories working in the 
general field, their home institutions and re- 
lated communities (salOO), students in pattern 
recognition or other related courses world- 
wide {^200), members of the Linux and Open 
Source community (salOOO), and interested 
e-citizens {^200) made aware through short 
notices on discussion or mailing lists, links 
from personal or lab home pages or broad pub- 
lic announcements. As mentioned above, con- 
tributors may be more motivated to contribute 
if they can see the total amount of data sub- 
mitted as well as comparisons with a state-of- 
the-art classifiers as the Open Mind classifier 
improves over time. 

5.2 Handwritten word recogni- 
tion 

A closely related problem would be recognition 
of handwritten words. Here the data would 




come from a large database of scanned hand- 
written documents automatically segmented 
by current algorithms. The e-citizen contrib- 
utors are either given a choice of candidate 
words from a classifier [18] (presented as click- 
able buttons) or are asked to type in their tran- 
scription. Here too, only the most informative 
or ambiguous word images would be presented 
to the e-citizens, and nonsense or poorly seg- 
mented words would be marked for elimination 
by e-citizens. 

5.3 Knowledge engineering 

A particularly interesting and instructive 
project could be based on CYC, a pioneer- 
ing attempt to capture common-sense knowl- 
edge. For over a decade, roughly a dozen CYC 
knowledge engineers have been entering more 
than 400,000 assertions (or "rules") designed 
to capture a significant portion of our consen- 
sus knowledge about the world [21]. For in- 
stance, CYC knows that a mother is older than 
her son, that clouds are usually outdoors, and 
that a cup filled with wine will be rightside-up, 
not upside-down. As part of its training, CYC 
"reasons" about its information, searching for 
inconsistencies or ambiguities, which are then 
presented as questions to be answered by the 
knowledge engineers. These answers, along 
with further rules, are recompiled and the cy- 
cle repeats. Its designers believe that once 
CYC attains a large amount of common-sense 
information, it can then continue to learn by 
"reading" digital encyclopedias or books. 

In an Open Mind project inspired by CYC, 
if the questions were posed in a clear way to 
even a fraction of the web population, large 
amounts of common sense knowledge could be 
entered rapidly. Done carefully, this would 
provide confirming or disconfirming evidence 
for the CYC model of knowledge representa- 
tion. Rather than begin with a project so am- 
bitious as general common sense, however, it 



7 




would be more productive to limit the domain 
of discourse, for instance to computers and 
software. Thus the system would learn "com- 
mon sense" knowledge, such as the need for 
programs to be compiled or interpreted, that 
early versions of code are often buggy, that a 
mouse is a peripheral, and so on. The domain 
experts would set the data structures and con- 
flict resolution algorithms so that the ontology 
of the domain can be captured. We can expect 
that many computer-savvy contributors would 
be motivated to provide such information and 
would be early users of any final system. A 
great challenge for the infrastructure providers 
would be to develop an interface, or possibly 
cast the data acquisition as a game. 

5.4 Future projects 

There are a number of projects that could take 
advantage of an Open Mind Initiative, vary- 
ing in the scope, difficulty, scientific or prac- 
tical usefulness, and so on. Since many OCR 
systems based solely on pixel images asymp- 
tote at a fairly high but not perfect accuracy 
(e.g., 95%), there is a clear need for algorithms 
for subsequent stages. Grammar, syntax, con- 
text and topic identification algorithms, de- 
veloped through Open Mind, would be valu- 
able here. While of modest scientific value, 
an Open Mind effort to develop an entry to 
the Loebner Prize ("Turing test") for the most 
"humanlike" dialog system would present an 
interesting challenge to interface and database 
design. For instance, consider a Dungeons and 
Dragons game in which the goal is to nav- 
igate through a castle to the "human" king 
while avoiding the "robot" king. In each room, 
you are presented with two short paragraphs, 
each "written" by one of the kings (i.e., by 
natural language text generation algorithms 
with different parameters). You then click on 
the paragraph that seems most "humanlike." 
Each choice is used to refine the computer 




models of word frequency, sentence and phrase 
structure, and so on, to improve the model of 
human-like language. 

Computer chess systems rely on massive 
parallel search guided by numerical scores of 
each board position. Some systems, such as 
IBM's Deep Blue, stress search over sophis- 
ticated board scoring [17]. An Open Mind 
chess project would allow interested amateurs 
to rate a large number of board positions and 
thereby improve the system's beam search. 
This might lead to a more "human" style 
of chess play. The reUabiUty of a contribu- 
tor might be based on public chess ratings 
or performance on an on-screen test. Be- 
cause the branching factor for search the Chi- 
nese/Japanese board game go is so high, go is 
unlikely to succumb to brute force approaches. 
Perhaps Open Mind Go, relying on a large 
number of board scores contributed by go 
players, would provide the first serious chal- 
lenge to human go masters. 

Building upon lessons learned from the 
knowledge engineering project described in 
Sect. 5.3, we can expect analogous projects in 
domains such as finance, office systems, sports, 
and so forth. One of the greatest benefits 
of the open nature of the Initiative is that 
it would facilitate the integration of different 
projects. Thus, speech recognition could be 
integrated with natural language and common 
sense databases for improved human-machine 
interfaces, smarter web searching, and so on. 

5.5 Business and legal issues 

Just as economic and commercial cases can 
be made for Open Source, so too can they be 
made for the Open Mind Initiative. Commer- 
cial firms could supply the resulting software 
directly or provide customization and support, 
as does Red Hat Software, Inc., for Linux. 
Hardware manufacturers could build devices 
that run Open Mind software, thereby reduc- 



8 



ing their software costs. Perhaps the most 
important benefit would be to open market 
niches (e.g., for common sense software) that 
few if any individual commercial software com- 
panies could provide economically. 

The legal matters would have to be consid- 
ered in order to avoid intellectual property dis- 
putes that have plagued big science projects 
such as the Human Genome Project. Perhaps 
since the data was freely contributed, it should 
be free for use, even in commercial settings. 
Thus a licensing agreement modeled on that 
for Berkeley FreeBSD might be appropriate. 

6 Related work 

There are several important differences be- 
tween Open Source/ Free Software and the pro- 
posed Open Mind Initiative, as shown in Ta- 
ble 1. 



Table 1: Comparison of Open Source and 
Open Mind approaches. 

In traditional Open Source each contributor 
typically provides code which addresses a dif- 
ferent functionality, for instance device drivers 
and routines for handling specific file formats 
in Linux. In Open Mind, however, the goal 



is generally a single function, such as accurate 
speech recognition. This distinction implies 
a somewhat different role for oversight in the 
two cases. In Open Mind, moderators should 
monitor the improvement on the single goal, 
rather than just compatibility of a proposed 
addition in Open Source. 

Table 2 compares traditional data mining 
over the web with the Open Mind approach. 
One significant difference here is that the 
needed data might be unavailable in data min- 
ing application, for instance OCR data is not 
available by data mining the web itself. 



Open Source 


Open Mind 


no e-citizens 


e-citizens crucial 


expert knowledge 


informal knowledge 


machine learning 
irrelevant 


machine learning 
essential 


web optional 


web essential 


most work is directly 
on the final software 


most work is not on 
the final software 


hacker/programmer 
culture 10^) 


e-citizen /business 
culture 10^) 


separate functions 
contributed (e.g., 
Linux device drivers) 


single function goal 
(e.g., OCR recogni- 
tion) 



Data Mining 


Open Mind 


needed data might be 
unavailable 


data tailored to the 
project (e.g., OCR) 


no queries 


interactive queries 


ambiguities may 
never be resolved 


ambiguities may be 
resolved by queries 


relatively fixed 
amount of data 


new data encouraged 
(feedback loop) 


slow learning 


faster learning 


little or no e-citizen 
support 


e-citizen support 
essential 



Table 2: Comparison of Data Mining and 
Open Mind approaches. 



7 Big science 

Physics has had its atom smashers, Aeronau- 
tics and Astronautics its space missions, Mi- 
crobiology its Human Genome Project. It is 
time for cognitive science, computer science, 
pattern recognition, artificial intelligence and 
related fields to have our big science — one 
based on a Open Source model in which all 
citizens can contribute a bit of their percep- 
tion and understanding of the world. Open 
Mind would cost a fraction of other big science 
projects yet be supremely useful. The only 



9 



"big" computer science project of this general 
nature {outside the military and national secu- 
rity area [9]) is the Digital Libraries Initiative 
[29], which axidresses expHcit formal knowl- 
edge. In contrast, the Open Mind Initiative 
would capture the implicit informal knowledge 
that all people have. 

Given the conjunction of several forces — 
the need for applications such as natural 
human-machine interfaces and improved web 
searching, the existence of good theory, good 
infrastructure, demonstrated success of the 
Open Source methodology — the time seems 
right for projects in the Open Mind frame- 
work. Lessons learned from the projects de- 
scribed in Sect. 5 will be invaluable for further 
progress. 

A plausible model of the human brain is a 
collection of modules, each evolved, trained 
and selected to address a small set of tasks. 
In a loose, metaphorical way, the Open Mind 
Initiative would build a "brain" in an analo- 
gous way. 

Acknowledgements 

The author would like to acknowledge discus- 
sions and correspondence with Marko Bala- 
banovic, Mindy Bokser, Ron Cole, Jonathan 
Hull, David Israel, Yann Le Gun, George 
Nagy, Eric Raymond, and Richard Schwartz. 
The views expressed here are those of the au- 
thor and not necessarily of those just listed. 



References 

[1] Mark S. Ackerman and Thomas W. Mal- 
one. Answer Garden: A tool for growing 
organizational memory. In Proceedings of 
the Conference on Office Automation Sys- 
tems, Filtering, Querying, and Navigat- 
ing, pages 31-39, New York, NY, 1990. 
ACM Press. 



[2] Mark S. Ackerman and David W. McDon- 
ald. Answer Garden 2: Merging organi- 
zational memory with collaborative help. 
In Proceedings of the ACM 1996 Confer- 
ence on Computer Supported Work, pages 
16-20, New York, NY, 1996. ACM Press. 

[3] Dana Angluin. Queries and concept 
learning. Machine Learning, 2(4):319- 
342, 1988. 

[4j Dana C. Angluin. Learning with queries. 
In Eric B. Baum, editor, Computa- 
tional Learning and Cognition, pages 1- 
28, Philadelphia, PA, 1993. SIAM. 

[5] David G. Bell, Daniel G. Bobrow, Olivier 
Raiman, and Mark H. Shirley. Dynamic 
documents and situated processes: Build- 
ing on local knowledge in field service. 
In Toshiro Wakayama, Srikanth Kanna- 
pan, Chan Meng Khoong, Shamkant Na- 
vathe, and JoAnne Yates, editors, Infor- 
mation and Process Integration in En- 
terprises: Rethinking Documents, pages 
261-276, Boston, MA, 1998. Kluwer Aca- 
demic Publishers. 

[6] Jared Bernstein, Kelsey Taussig, and 
Jack Godfrey. M aerophone: An Amer- 
ican English telephone speech corpus 
for the Polyphone project. In Proceed- 
ings of the International Conference on 
Automatic Speech and Signal Processing 
(ICASSP94), volume I, pages 81-84, Ade- 
laide, Austraiha, 1994. 

[7] Mindy Bokser, 1999. Personal communi- 
cation (Caere Corporation). 

[8] Leo Breiman, Jerome H. Friedman, 
Richard A. Olshen, and Charles J. Stone. 
Classification and regression trees. Chap- 
man k Hall, New York, NY, 1993. 

[9] Paul Cohen, Robert Schrag, Eric Jones, 
Adam Pease, Albert Lin, Barbara Starr, 



10 



David Gunning, and Murray Burke. 
The DARPA high-performance knowl- 
edge bases project. AI Magazine, 
19(4):25-49, 1998. 

[10] Walter Daelemans, Antal van den Bosch, 
Jakub Zavrel, Jorn Veenstra, Sabine 
Buchholz, and Bertjan Busser. Rapid 
development of NLP modules with 
memory-based learning. In Roberto 
Basili and Maria Theresa Pazienza, 
editors, ECML98 TAN LPS Workshop 
Notes, pages 1-17, Technische Universitat 
Chemnitz, 1998. 

[U] Richard O. Duda, Peter E. Hart, and 
David G. Stork. Pattern Classification. 
Wiley, New York, NY, second edition, 
2000. 

[12] Christiane Fellbaum, editor. WordNet: 
An Electronic Lexical Database. MIT 
Press, Cambridge, MA, 1998. 

[13] Isabelle Guyon, John Makhoul, Richard 
Schwartz, and Vlaxiimir Vapnik. What 
size test set gives good error rate es- 
timates? IEEE Pattern Analysis and 
Machine Intelligence, PAMI-20(l):52-64, 
1998. 

[14] Thien M. Ha. Efficient detection of abnor- 
malities in large OCR databases. In Pro- 
ceedings of the International Conference 
on Document Analysis and Recognition 
(ICDAR97), volume 2, pages 1006-1010, 
Los Alamitos, CA, 1997. IEEE Press. 

[15] Tin Kam Ho and Henry S. Baird. Large- 
scale simulation studies in pattern recog- 
nition. IEEE Transactions on Pat- 
tern Analysis and Machine Intelligence, 
PAMI-19(10):1067-1079, 1997. 

[16] Jerry R. Hobbs, Douglas Appelt, John 
Bear, and David Israel. FASTUS: A 



system for extracting information from 
text. In Proceedings of the ARPA Hu- 
man Language Technology Workshop '93, 
pages 133-137, Princeton, NJ, 1994. Dis- 
tributed as Human Language Technology 
by San Mateo, CA; Morgan Kaufmann 
Publishers. 

[17] Feng-hsiung Hsu, Thomas Ananthara- 
man, Murray Campbell, and Andreas 
Nowatzyk. A grandmaster-level-chess ma- 
chine. Scientific American, 263(4):44-50, 
1990. 

[18] Jonathan H. Hull, Tin Kam Ho, John Fa- 
vata, Venu Govindaraju, and Sargur N. 
Srihari. Combination of segmentation- 
based and wholistic handwritten word 
recognition algorithms. In Sebastiano 
Impedovo and Jean-Claude Simon, edi- 
tors, Prom Pixels to Features III: Fron- 
tiers in Handwriting Recognition, pages 
261-272, New York, NY, 1992. Elsevier- 
North Holland. 

[19] Frederick Jelinek. Statistical Methods for 
Speech Recognition. MIT Press, Cam- 
bridge, MA, 1998. 

[20] Chuck Lam and David G. Stork. Open 
Mind Animals, 1999. Java program and 
documentation. 

[21] Doug B. Lenat. CYC: A large-scale 
investment in knowledge infrastructure. 
Communications of the ACM, 38(11):33- 
38, 1995. 

[22] James H. McLean, 1999. Personal com- 
munication from the Los Angeles County 
Museum of Natural History. 

[23] Baback Moghaddam, Tony Jebara, and 
Alex Pentland. Bayesian modeling of fa- 
cial similarity. In Advances in Neural In- 
formation Processing Systems, volume 10. 
MIT Press, Cambridge, MA, 1998. 



11 



[24] Baback Moghaddam and Alex Pentland. 
Maximum-likelihood detection of faces 
and hands. In Proceedings of the Inter- 
national Workshop on Automatic Face- 
and Gesture- Recognition, pages 122-128, 
Zurich, Switzerland, 1995. 

[25] K. M. Elisabeth Murray. Caught in the 
Web of Words: James A. Murray and the 
Oxford English Dictionary. Yale Univer- 
sity Press, New Haven, CT, reprint edi- 
tion, 1995. 

[26] Eric S. Raymond, 1999. Personal commu- 
nication from OpenSource . org. 

[27] Stuart Russell and Peter Norvig. Arti- 
ficial Intelligence: A Modem Approach. 
Prentice Hall Series in Artificial Intelli- 
gence. Prentice Hall, Englewood Cliffs, 
NJ, 1995. 

[28] Michael Sabourin, Amar Mitiche, Danny 
Thomas, and George Nagy. Hand-printed 
digit recognition using nearest neighbors 
classifiers. In Proceedings of the Second 
Annual Symposium on Document Analy- 
sis and Information Retrieval, pages 397- 
409, Las Vegas, NV, 1993. 

[29] Bruce Schatz and Hsinchun Chen. Build- 
ing large-scale digital libraries. IEEE 
Computer, 29(5):22-26, 1996. 

[30] Stuart C. Shapiro, January 1982. Pro- 
gramming Project 1: Animal Program 
(20 Questions), "Introduction to Artifi- 
cial Intelligence," Department of Com- 
puter Science, State University of New 
York at Buffalo. 

(31) Edward H. Shortliffe. Computer- 
Based Medical Consultations: MYCIN. 
Elsevier/North-Holland, New York, NY, 
1976. 



[32] Seth Shostak. Sharing the Universe: Per- 
spectives on Extraterrestrial Life. Berke- 
ley Hills, Berkeley, CA, 1998. 

[33] David G. Stork and Marcus E. Hennecke, 
editors. Speechreading by Humans and 
Machines: Models, Systems, and Applica- 
tions. NATO Advanced Studies Institute. 
Springer, New York, NY, 1996. 

[34] Stephen Sutton, Ron A. Cole, Jacques 
de Villiers, Johan Schalkwyk, Pieter 
Vermeulen, Michael Macon, Yonghon 
Yan, Ed Kaiser, Brian Rundle, Kal 
Shobaki, Peter Hosom, Alex Kain, Johan 
Wouters, Dominic Massaro, and Michael 
Cohen. Universal speech tools: The 
CSLU Toolkit. In Proceedings of the In- 
ternational Conference on Spoken Lan- 
guage Processing (ICSLP98), pages 3211- 
3224, Sydney, Australia, November 1998. 

[35] Antal van den Bosch and Walter Daele- 
mans. Data-oriented methods for 
grapheme-to-phoneme conversion. In 
Proceedings of the Sixth Conference of the 
European Chapter of the ACL, pages 45- 
53. ACL, 1993. 



12 



IEEE Xplore Citation 



. IEEE HOME t SEARCH IEEE 1 SHOP I WEB ACCOUNT I CONTACT IEEE 



http://search.ieeexplore.ieee.org/search97/s97is. vts?action=View 
^EEE 



Xplore' 



Welcome 

United states Potent and Trademarfc Office 



Herp FAQ 
Review 



Terms IEEE Peer 



I Quick Links " jj 



» Abstract Plus 



!ff!!n?lffff?!jsEARCH RESULTS FPOF Full-Text (220 KB11 PREVIOUS NEXT DOWNLOAD CrTATION 



O-Home 
O What Can 
I Access? 

O- Log-out 



Tables ot Contents 



O- Journals 
^ &M3Ba2ines 

O" Conference 
Proceedings 

O" Standards 



O By Author 
O-Basic 



MemtKf Services 



O Join IEEE 
O Establish IEEE 
Wet) Account 

O- Access t*» 
IEEE Member 
Digital Ubraiy 

S Print Format 



Character and document research in the Open Mind 
Initiative 

Stork. D.G. 

Ricoh Silicon Valley, Menio Park, CA; 

This paper appears in: Document Analysis and Recognition, 1999. ICDAR 
'99. Proceedings of the Fifth International Conference on 

Meet:ing Date: 09/20/1999 -09/22/1999 

Publication Date: 20-22 Sep 1999 

Location: Bangalore , India 

On page(s): 1-12 

References Cited: 35 

IEEE Catalog Number: PR00318 

Number of Pages: xxiv+821 

INSPEC Accession Number: 6352780 



Abstract: 

We describe the Open Mind Initiative, a framework for large scale collaborative 
efforts in building components of "intelligent" systems that address common 
sense reasoning, document and language understanding, speech and 
character recognition, and so on. Based on the Open Source methodology, the 
Open l^fnd Initiative allows domain specialists to contribute algorithms, tool 
developers to provide software Infrastructure and tools, and non specialist 
"e-citizens" to contribute training data and information to large databases. An 
important challenge is to make it easy and rewarding for e-citizens to provide 
such information. The paper illustrates the initiative through several 
demonstration projects of modest scale, including some related to character 
and document problems, and identifies general challenges and opportunities 

Index Terms: 

character recognition common-sense reasonino database manacement systems 
document handling knowledge based systems speech recognition Open Mind 
Initiative Open Source methodology character recognition common sense reasoning 
document problems document research domain specialists e-citizens intelligent 
systems language understanding large databases large scale collaborative efforts 
software infrastructure tool developers training data 



Documents that cite this document 

Select link to view other documents in the database that cite this one. 



SEARCH RESULTS [PPF Full-Text f220 KB)1 PREVIOUS NEXT DOWNLOAD CITATION 



Home I LoQ-oul ( Journals I Conference Proceedings I Standards I Search by Author I Basic Search I Advanced 

Search 

Join IEEE I Web Account i New this week I OPAC Linking Information | Your Feedback I Technical Support I Email 

Alerting 

No Robots Please | Release Notes | IEEE Online Publications j Help I FAQ j Temis I Back to Top 
Copyright © 2003 IEEE — All rights reserved 



